Wines Exploration Data Analysis by Fabien Martin

Sources:

I felt compelled to undertake the wine dataset. Having studied in Reims (Capital of Champagne), I had some (crazy) hope that the white wine dataset would somewhat be about champagne. Unfortunately, it was not. Since I wanted to make some pretty charts, I decided to mix the white wine and the red wine dataset.

More information about the datasets can be found here: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Let’s use R to make our Exploratory Data Analysis and learn a bit more about this dataset and what makes a good wine!

First, Let’s see what do we have in the dataset:

## [1] 6497
## [1] 14

This dataset consists of 14 variables, with almost 6,500 wines observations (1599 red wines and 4898 white wines). Let’s have a deeper look to our variable names:

## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ color               : chr  "white" "white" "white" "white" ...

Let’s rename somne variables (because yes, we are lazy):

##        X           fixed.a         volatile.a        citric.a     
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##    residual.s       chlorides          free.sd          total.sd    
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00   Min.   :  6.0  
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00   1st Qu.: 77.0  
##  Median : 3.000   Median :0.04700   Median : 29.00   Median :118.0  
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53   Mean   :115.7  
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00   3rd Qu.:156.0  
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00   Max.   :440.0  
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         color          
##  Min.   :3.000   Length:6497       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.818                     
##  3rd Qu.:6.000                     
##  Max.   :9.000

Let’s drop the X variable which is basically just the ID of the wine (especially irrelevant since we binded both dataframes)

Let’s see if we have any missing values in our dataframe!

##    fixed.a volatile.a   citric.a residual.s  chlorides    free.sd 
##          0          0          0          0          0          0 
##   total.sd    density         pH  sulphates    alcohol    quality 
##          0          0          0          0          0          0 
##      color 
##          0

No data point missing!

Let’s dive deeper into the dataset!

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000

The quality follow a normal distribution. the quality range from 3 to 9 with the median at 6 and the mean at 5.818.

Apparently, no whine is perfect!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

The alcohol distribution is right-skewed. The alcohol range from 8% to 14.9% with the median at 10.30 and the mean at 10.49.

This makes sense since it is pretty rare to have wines under 8% (because it’s hard to make even though it exists) or over 16% (because of tax reasons).

It would have been interesting to have region of origin to check my stereotype: Wines with less than 11% of alcohol comes from fresh climates. Wines with more than 13% of alcohol comes from hot climates.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

The sulphates distribution is right-skewed. the sulphates range from 0.22 to 2 with a median at 0.51 and mean at 0.5313.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

The pH follow a normal distribution.the pH range from 2.72 to 4.01 with a median at 3.21 and a mean at 3.219. It has a few outliers.

Wine has quite a low pH compared to water (7.0), wich should not be surprising.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

The density follow a normal distribution with an extreme outlier at 1.039. The density range from 0.9871 to 1.039 with a media at 0.9923 and a mean at 0.9947.

Very few wine seems to have a higher density than water (1.0).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

The total sulfur dioxide distribution appears to follow a bimodal distribution with modes around 20 and 120. The sulfur dioxide range from 6.0 to 440.0 with a median at 118.0 and a mean at 117.7.

Interestingly, a total sulfur dioxide above 50 ppm (or mg / dm^3) affect the taste of the wine. Will it impact quality?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

The free sulfur dioxide follow a right-skewed distribution with some extreme outliers. The free sulfur dioxide range from 1 to 289 with a median at 29 and a mean at 30.53.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

The chlorides follow a right-skewed distribution with some extreme outliers. the chlorides range from 0.09 to 0.611 with a median at 0.047 and a mean at 0.05603.

A wine should not really be salty, I wonder if the concentration of chlorides will have an effect on the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

The residual sugar follow a right-skewed distribution with some extreme outliers. The residual sugar range from 0.6 to 65.8 with a median at 3 and a mean at 5.443.

I wonder if we an find a “sweet” spot between quality and residual sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

The citric acid follow a normal distribution with some extreme outliers. The citric acid range from 0 to 1.66 with a median at 0.31 and a mean at 0.3186.

Citric acid is apparently good in small quantities to add freshness to wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

The volatile acid follow a right skewed distribution. The volatile acid range from 0.08 to 1.58 with a median at 0.29 and mean at 0.3397.

Too much volatile acid can impact the taste of the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

The fixed acid distribution is right-skewed. The fixed acid range from 3.8 to 15.9 with a median at 7.0 and a mean at 7.215.

Univariate Analysis

What is the structure of your dataset?

There are 6497 wines in the dataset with 13 features (fixed.a, volatile.a, citric.a, residual.s, chlorides, free.sd, total.sd, density, pH, sulphates, alcohol, quality, color).

Other observations:

  • There are more white wines than red wines.
  • The median quality is 6 and there are no wine with a quality of 10.
  • The range of alcohol in wine is quite wide.
  • Most wines have very little sugar (under 6 ppm).

What is/are the main feature(s) of interest in your dataset?

The main features in the the data set are quality and alcohol. I would like to determine which feature are the best for predicting the quality of a wine. I believe that some other features might have an impact on the quality of the wine. I also wonder if white and red wines have different quality profiles.

What other features in the dataset do you think will help support your

pH, Volatile Acid, Citric Acid, Residual Sugar, Chlorides, Total Sulfur Dioxide. I think the pH and the residual sugar would have the more effect on the wine quality (hint: I was wrong!).

Did you create any new variables from existing variables in the dataset?

I want to create to create a new variable ~ rating: I will separate it into 3 categories: * Quality: under 4 (included) -> “Poor” * Quality: between 5 to 6 -> “Average” * Quality: over 7 (included) -> “Good”

We will then have 14 variables in our dataset.

There are less bad wines than good wines.

Of the features you investigated, were there any unusual distributions?

Quite a few (7) of the distributions are right_skewed: * alcohol * sulphates * free sulfur * chlorides * residual sugar * volatile acid * fixed acidity

Most (10) of the distributions have outliers: * sulphates * pH * density * total sulfur dioxide * free sulfur dioxide * chlorides * residual sugar * citric acid * volatile acid * fixed acidity

One distribution is bimodal: * total sulfur dioxide

I did not make any change to the distribution for the univariate analysis.

Bivariate Plots Section

Correlation matrix

This is a very interesting chart:

Relationships related to quality I want to explore. * Quality is positively correlated with alcohol. * Quality is slightly negatively correlated with density, volatile acidity and chlorides.

Additional relationship not related to quality I want to explore. * Alcohol is strongly negatively correlated with density. * Density is strongly positively correlated with residual sugar, fixed acidity.

Other notable relationships. * Free sulfur dioxide is positively correlated with total sulfur dioxide and residual sugar.

Let’s look in more details at these relationships

## mixed_wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.625  10.150  10.215  11.000  12.600 
## -------------------------------------------------------- 
## mixed_wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.00   10.18   10.90   13.50 
## -------------------------------------------------------- 
## mixed_wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.300   9.600   9.838  10.300  14.900 
## -------------------------------------------------------- 
## mixed_wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.50   10.59   11.40   14.00 
## -------------------------------------------------------- 
## mixed_wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.62   11.40   11.39   12.30   14.20 
## -------------------------------------------------------- 
## mixed_wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.68   12.60   14.00 
## -------------------------------------------------------- 
## mixed_wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$alcohol and mixed_wines$quality
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

The boxplot provide us with a interesting insight: * Low quality wines in general have less alcohol than their better ranked counterparts.

This is confirmed by a Pearson’s correlation coeficient of 0.443185 which show a moderately strong positive correlation.

This confirms that alcohol has a important impact on the quality of the wine.

This visualization show even more clearly that the good wines have higher alcohol concentration than average and bad wines.

Not surprisingly, the wine’s color does not influence the quality of the wine.

Interesting to notice than only white wines got a 9 mark in quality. However than might be due to the difference in size of the 2 datasets.

We still need to verify if good red wines share the same characteristic than good white wines.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$density and mixed_wines$quality
## t = -25.89, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3277372 -0.2836508
## sample estimates:
##        cor 
## -0.3058579

The boxplot provide us with a interesting insight: * Low quality wines in general have a higher density than their better ranked counterparts.

This is confirmed by a Pearson’s correlation coeficient of - 0.3058579 which show a moderately strong negative correlation.

This confirms that density has a important impact on the quality of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$volatile.a and mixed_wines$quality
## t = -22.212, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2881545 -0.2429524
## sample estimates:
##        cor 
## -0.2656995

The boxplot provide us with a interesting insight: + Low quality wines in general have a higher volatile acidity than their better ranked counterparts.

This is confirmed by a Pearson’s correlation coeficient of - 0.2656995 which show a slight negative correlation.

This confirms that volatile acidity has a important impact on the quality of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$chlorides and mixed_wines$quality
## t = -16.508, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2238898 -0.1772134
## sample estimates:
##        cor 
## -0.2006655

The boxplot provide us with a interesting insight: * Low quality wines in general have a higher chlorides than their better ranked counterparts.

This is confirmed by a Pearson’s correlation coeficient of - 0.2058579 which show a slight negative correlation.

This confirms that chlorides has a important impact on the quality of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$alcohol and mixed_wines$density
## t = -76.14, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6993829 -0.6736787
## sample estimates:
##        cor 
## -0.6867454

The scatter plot provide us with a interesting insight: * The more alcohol, the more density.

This is confirmed by a Pearson’s correlation coeficient of - 0.6867454 which show a strong negative correlation.

This is interesting because both alcohol and density are correlated with wine’s quality but they are also heavily correlated between themselves. It will probably not be possible to throw them together in a model to predict wine’s quality.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$residual.s and mixed_wines$density
## t = 53.423, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5353934 0.5691865
## sample estimates:
##      cor 
## 0.552517

The scatter plot provide us with interesting insights: * The more residual sugar, the more density. * Seems that there are two different trend in the data.

This is confirmed by a Pearson’s correlation coeficient of 0.552517 which show a strong positive correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  mixed_wines$fixed.a and mixed_wines$density
## t = 41.626, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4394976 0.4778939
## sample estimates:
##     cor 
## 0.45891

The scatter plot provide us with interesting insights: * The more fixed acidity, the more density even though a lot of data is still concentrated around 6 to 8 fixed acidity.

This is confirmed by a Pearson’s correlation coeficient of 0.45891 which show a strong positive correlation.

Correlation Matrix for white wines

Correlation Matrix for red wines

The 2 correlations matrices show very interesting insights: * Both wine colors have somne similarities but also some differences. * The quality of white and red wines is strongly positively correlated with the alcohol variable and negatively correlated with volatile acidity. * White wines’ quality is more strongly correlated with density and chlorides. * Red wines’ quality is more strongly correlated with sulphates and citric acid.

That led me to believe that if we would like to create a model to predict the quality, we would need to separate red and white wine even though the color is not correlated with quality somply because the color influence the importance of other variables on the quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

I think the triple relationship between quality, alcohol and density is interesting since the three variables are correlated. We will need to get rid of either alcohol or density in the future if we want to build a meaningful wine quality prediction in the future.

Based on the correlation coeficients, I think it would make more sense to continue with the alcohol variable rather than the density variable.

Volatile acidity and chlorides both have impact on the quality score.

It’s interesting to see that the color of the wine does not have a impact on the quality but have an indirect impact on which variable are correlated with the quality.

For red wines, sulphates and citric acid is more important whereas for white wines, it is more density and chlorides.

Maybe looking more in details into the newly created ranking variable would be interesting in the future.

Did you observe any interesting relationships between the other features

Apart from the strong correlation between density and alcohol, density is correlated with several variables.

What was the strongest relationship you found?

The strongest relation I found is the density/alcohol relationship with -0.6867454.

Multivariate Plots Section

Red and White Wines

Volatile acidity, which was highly correlated with quality on the whole dataset in our previous exploration acutally reveal a stark difference between white and red wines.

Red wines have mostly a volatile acidity above 0.4 whereas white wines have moslty a volatile acidity under 0.4.

As previous results previously hinted, the most correlated with quality variables might be different for the whole dataset, red wines and white wines.

Let’s explore more volatile acidity by differenciating between white and red wines.

## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).

The scatter plots provide us with interesting insights: * The volatile acidity of good white wines is between 0.10 and 0.60 and is combined with alcohol above 11. * The volatile acidity of good red wines is between 0.20 and 0.75 and is combined with alcohol above 10. * The volatile acidity of average white wines is lower than the volatile acidity of red wines. * Both white and red average wines have a lower alcohol level. * Bad wines usually have lower alcohol concentration.

Even though volatile acidity is correlated with quality for the total dataset, it seems to have different effect based on the wine color.

## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "white")$volatile.a and subset(mixed_wines, color == "white")$quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723
## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "red")$volatile.a and subset(mixed_wines, color == "red")$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

We can see that there is an important difference of correlation between quality and white and red wine’s volatile acidity. Red wine’s quality is strongly negatively correlated with volatile acidity (-0.3905578) whereas white wine’s quality is only moderately correlated with volatile acidity (-0.194723).

Let’s have a look at the last variable which is moderately correlated with quality for the whole dataset.

## Warning: Removed 58 rows containing missing values (geom_point).

Once again we can see that there is a clear separation between red and white wines. We will probably have different correlation based on the wine color.

Anyway, the scatter plot gave use several interesting insights: * Red wines have higher chlorides concentration (0.7 to 0.11) than white wines (0.4 to 0.7). * White wines have a larger range of alcohol (8.5 to 13.5) than red wines(9 to 13).

## Warning: Removed 17 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).

The scatter plots give us again very a interesting insight: * Good white wines have lower chlorides (0.02 to 0.05) and stronger alcohol(11 to 14) than good red wines (0.05 to 0.125 for chlorides and 11 to 13 for alcohol)

## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "white")$chlorides and subset(mixed_wines, color == "white")$quality
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344
## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "red")$chlorides and subset(mixed_wines, color == "red")$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

Once again the correlation with quality is very different for both white and red wines.

White wine’s quality is moderatively negatively correlated with chlorides (-0.2099344) whereas red wine’s quality is only slightly correlated with chlorides (-0.1289066)

It would be difficult to make a general rule for good quality wine without separating it by color.

Even alcohol which is strongly correlated with alcohol quality for both red and white wines would be problematic since we noticed that red wines usually have less alcohol than their white counterpart. Using a model on the whole dataset whitout separating by color would lead to red wines misclassified due to their lower alcohol concentration.

White Wines

White wines highest correlation according to our correlation matrix was: * Alcohol * Density (that we decided to not use due to its strong correlation to the alcohol variable) * Chlorides * Volatile acidity

Unsurprisingly these are the variables that we tested on the whole dataset.

Why unsurprising? Because there are more white wines than red wines in our dataset and it definietly influence our correlation calculation to “advantage” white wines’ correlated features.

Anyway, what makes a good white wine? * Alcohol above 11. * Volatile Acidity between 0.10 and 0.60. * Chlorides between 0.02 to 0.05.

Let’s investigate more red wines now

Red Wines

## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "red")$sulphates and subset(mixed_wines, color == "red")$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

This scatter plot give us very interesting insights: * Good red wines have higher sulphate (0.6) at low alcohol (under 11) level than average red wines. * Good red wines can have lower sulphate (0.5) at higher alcohol (above 11) level but they still have more than average.

Red wines’ quality is moderately positive correlated with sulphates (0.2513971).

For reference, white wine is only loosely correlated with sulphates (0.05367788).

## 
##  Pearson's product-moment correlation
## 
## data:  subset(mixed_wines, color == "red")$citric.a and subset(mixed_wines, color == "red")$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The scatter plot gave us a very interesting insight: * Good quality red wines usually have higher citric acid (0.25) and alcohol (10) than average or bad red wines.

This is confirmed by the moderately positive correlation (0.2263725) between quality and citric acid.

So, what’s make a good red wine? * Alcohol above 10. * Sulphate above 0.5. * Citric acid above 0.25.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The created color variable seems to be one of the most important variable given the difference between the 2 kind of wines.

I understand better why Udacity decided to keep these 2 dataset separated for this project.

Alcohol strongly influence both white and red wine quality.

Final Plots and Summary

Plot One

Description One

I really like this plot because it shows the distribution of quality and alcohol amongst red and white wines. It shows that the alcohol variable affect the quality of the wine for both white and red wine: better quality wines have higher alcohol. It also shows that there are more white wines than red wines and that the range of alcohol for white wine is usally larger.

Plot Two

Description Two

I really like this chart because it is the first one that draw my attention on the strong difference between white and red wines. We can see that white wines have lower volatile acidity (between 0.1 and 0.5 ppm) whereas red wines have higher one (between 0.35 and 0.8 ppm). I also think that the alcohol range difference is clearer here than in the previous graph. It made me rethink my approach on the whole dataset: I decided not to consider all the wines but rather re-split it into white and red wines for subsequent analysis.

Plot Three

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).

Description Three

I really like this plot because it confirm what the previous chart hinted: good white and good red wines are different. On this plot we can see that the cluster for good and average wines are really different for white and red wines. White wine cluster for good wines in between 11 to 14 % of alcohol by volume and between 0.10 to 0.6 ppm. In comparison, the good red wine cluster is between 10 and 13% of alcohol by volume and between 0.25 and 0.75 ppm.


Reflection

First, I understand why the 2 datasets gave been separated into 2 projects. It is more straightforward to go with analyzing only one part of the dataset.

I struggled a bit at the beginning mostly because I felt something wasn’t right with the results obtained with the full dataset. This became apparent when I started plotting multivariate plots. There is a (not so surprising) big difference between red and white wines.

I think it was really interesting to learn R in an independent manner after the lecture and experimenting on different vizualisations.

Let’s keep in mind that these conclusion on what’s make a good wine are based on a limited number of data (6497 in total) and the quality of a wine may vary based on culture, geography and personal taste!

As such, for future exploration I think it would be interesting to have the geographical origin or wine, more wine tester and the country of origin of the tester. It would be fun to explore the dataset and uncover the different taste based on different region. One might even see if taster from a specific geographical area actually have a preference for local wines.

Safe drinking, enjoy with moderation!